

Appl. No. 09/703,140  
Amtd. dated March 10, 2004  
Response to Office Action of November 10, 2003

**Amendments to the Specification:**

Rewrite the paragraph at page 1, lines 8 to 10 as follows:

This application claims priority under 35 USC §119(e)(1) of Provisional Application No. 60/183,527, filed February 18, 2000 (TI-30302PS) and of Provisional Application No. 60/183,654, filed February 18, 2000 (TI-26010PS).

Rewrite the paragraph at page 6, line 16 to page 7, line 5 as follows:

In microprocessor 1 there are shown a central processing unit (CPU) 10, data memory 22, program memory 23, peripherals 60 and an external memory interface (EMIF) with a direct memory access (DMA) 61. CPU 10 further has an instruction fetch/decode unit 10a-c, a plurality of execution units, including an arithmetic and load/store unit D1, a multiplier M1, an ALU/shifter unit S1, an arithmetic logic unit ("ALU") L1, a shared multi-port register file 20a from which data are read and to which data are written. Instructions are fetched by fetch unit 10a from instruction memory 23 over a set of busses 41. Decoded instructions are provided from the instruction fetch/decode unit 10a-c to the functional units D1, M1, S1, and L1 over various sets of control lines which are not shown. Data are provided to/from the register file 20a from/to to load/store ~~units~~ unit D1 over a first set of busses 32a, to multiplier M1 over a second set of busses 34a, to ALU/shifter unit S1 over a third set of busses 36a and to ALU L1 over a fourth set of busses 38a. Data are provided to/from the memory 22 from/to the load/store ~~units~~ unit D1 via a fifth set of busses 40a. Note that the entire data path described above is duplicated with register file 20b and execution units D2, M2, S2, and L2. Load/store unit D2 similarly interfaces with memory 22 via a set of busses 40b. In this embodiment of the present invention, two unrelated aligned double word (64 bits) load/store transfers can be made in parallel between CPU 10 and data memory 22 on each clock cycle using bus set 40a and bus set 40b.

Rewrite the paragraph at page 8, lines 3 to 12 as follows:

App. No. 09/703,140  
Amtd. dated March 10, 2004  
Response to Office Action of November 10, 2003

When microprocessor 1 is incorporated in a data processing system, additional memory or peripherals may be connected to microprocessor 1, as illustrated in Figure 1. For example, Random Access Memory (RAM) 70, a Read Only Memory (ROM) 71 and a Disk 72 are shown connected via an external bus 73. Bus 73 is connected to the External Memory Interface (EMIF) which is part of functional block 61 within microprocessor 1. A Direct Memory Access (DMA) controller is also included within block 61. The DMA controller part of functional block 61 connects to data memory 22 via bus 43 and is generally used to move data between memory and peripherals within microprocessor 1 and memory and peripherals which are external to microprocessor 1.

**Rewrite the paragraph at page 8, lines 16 to 22 as follows:**

A detailed description of various architectural features of the microprocessor of Figure 1 is provided in coassigned application S.N. 09/012,813 (TI 25311) U.S. Patent 6,182,203 and is incorporated herein by reference. A description of enhanced architectural features and an extended instruction set not described herein for CPU 10 is provided in coassigned U.S. Patent application S.N. \_\_\_\_\_ (TI 30302) 09/703,096 *Microprocessor with Improved Instruction Set Architecture* and is incorporated herein by reference.

Rewrite the paragraph at page 8, line 23 to page 9, line 2 as follows:

Figure 2 is a block diagram of the execution units and register files of the microprocessor of Figure 1 and shows a more detailed view of the buses connecting the various functional blocks. In this figure, all data busses are 32 bits wide, unless otherwise noted. There are two general-purpose register files (A and B) in the processor's data paths. Each of these files contains 32 32-bit registers (A0-A31 for register file A 20a and B0-B31 for register file B 20b). The general-purpose registers can be used for data, data address pointers, or condition registers. Any number of reads of a given register can be performed in a given cycle.

Rewrite the paragraph at page 11, line 3 to page 12, line 7 as follows:

Appl. No. 09/703,140  
Amdt. dated March 10, 2004  
Response to Office Action of November 10, 2003

Most data lines in the CPU support 32-bit operands, and some support long (40-bit) and double word (64-bit) operands. Each functional unit has its own 32-bit write port into a general-purpose register file (Refer to Figure 2). All units ending in 1 (for example, .L1) write to register file A 20a and all units ending in 2 write to register file B 20b. Each functional unit has two 32-bit read ports for source operands *src1* and *src2*. Four units (.L1, .L2, .S1, and .S2) have an extra 8-bit-wide port for 40-bit long writes, as well as an 8-bit input for 40-bit long reads. Because each unit has its own 32-bit write port, when performing 32-bit operations all eight units can be used in parallel every cycle. Since each multiplier can return up to a 64-bit result, two write ports (*dst1* and *dst2*) are provided from the multipliers to the respective register file.

Rewrite the paragraph at page 12, lines 10 to 18 as follows:

Each functional unit reads directly from and writes directly to the register file within its own data path. That is, the .L1 unit 18a, .S1 unit 16a, .D1 unit 12a, and .M1 units unit 14a write to register file A 20a and the .L2 unit 18b, .S2 unit 16b, .D2 unit 12b, and .M2 units unit 14b write to register file B 20b. The register files are connected to the opposite-side register file's functional units via the 1X and 2X cross paths. These cross paths allow functional units from one data path to access a 32-bit operand from the opposite side's register file. The 1X cross path allows data path A's functional units to read their source from register file B. Similarly, the 2X cross path allows data path B's functional units to read their source from register file A.

Rewrite the paragraph at page 12, lines 19 to 23 as follows:

All eight of the functional units have access to the opposite side's register file via a cross path. The .M1, .M2, .S1, .S2, .D1, and .D2 units' *src2* inputs are selectable between the cross path and the same side register file. In the case of the .L1 and .L2 both *src1* and *src2* inputs are also selectable between the cross path and the same-side register file. Cross path 1X bus 210 couples one input of multiplexer 211 for src1 input of .L1 unit 18a, multiplexer 212 for src2 input of .L1 unit 18a, multiplexer 213 for src2 input of .S1 unit 16a and multiplexer 214 for src2

Appl. No. 09/703,140  
Amdt. dated March 10, 2004  
Response to Office Action of November 10, 2003

input of M1 unit 14a. Multiplexers 211, 212, 213, and 214 select between the cross path 1X bus 210 and an output of register file A 20a. Buffer 250 buffers cross path 2X output to similar multiplexers for L2, S2, M2, and D2 units.

Insert the following new paragraph at page 13, between lines 9 to 10:

S2 unit 16b may write to control register file 102 from its dst output via bus 220. S2 unit 16b may read from control register file 102 to its src2 input via bus 221.

Rewrite the paragraph at page 13, line 24 to page 14, line 4 as follows:

Bus 40a has an address bus DA1 which is driven by mux 200a. This allows an address generated by either load/store unit D1 or D2 to provide a memory address for loads or stores for register file 20a. Data Bus LD1 loads data from an address in memory 22 specified by address bus DA1 to a register in load unit D1. Unit D1 may manipulate the data provided prior to storing it in register file 20a. Likewise, data bus ST1 stores data from register file 20a to memory 22. Load/store unit D1 performs the following operations: 32-bit add, subtract, linear and circular address calculations. Load/store unit D2 operates similarly to unit D1 via bus 40b, with the assistance of mux 200b for selecting an address.

Rewrite the paragraph at page 14, lines 15 to 23 as follows:

Table 3 defines the mapping between instructions and functional units for a set of basic instructions included in a DSP described in U.S. Patent S.N. 09/012,813 (TI-25311, 6,182,203 incorporated herein by reference). Table 4 defines a mapping between instructions and functional units for a set of extended instructions in an embodiment of the present invention. Alternative embodiments of the present invention may have different sets of instructions and functional unit mapping. Table 3 and Table 4 are illustrative and are not exhaustive or intended to limit various embodiments of the present invention.

Appl. No. 09/703,140  
Amdt. dated March 10, 2004  
Response to Office Action of November 10, 2003

Rewrite the paragraph at page 19, lines 12 to 18 as follows:

Performance can be inhibited by stalls from the memory system, stalls for cross path dependencies, or interrupts. The reasons for memory stalls are determined by the memory architecture. Cross path stalls are described in detail in U.S. Patent S.N. ~~\_\_\_\_\_ (TI-30563)~~ 09/702,453, to Steiss, et al and is incorporated herein by reference. To fully understand how to optimize a program for speed, the sequence of program fetch, data store, and data load requests the program makes, and how they might stall the CPU should be understood.

Rewrite the paragraph at page 27, line 25 as follows:

The remaining steps 620, 630 and ~~64~~ 640 are identical to Figure 6A.

Rewrite the paragraph at page 28, lines 5 to 27 as follows:

The M unit has three major functional units: Galois multiply unit 700a-c, multiply unit 710 and other non-multiply functional circuitry in block 720. Galois multiplier 700a-c and multiplier 710 require three additional cycles to complete the multiply operations, so multiply instructions are categorized as having three delay slots. Pipeline registers 730-733 hold partial results between each pipeline execution phase. In general, multiply unit 710 can perform the following operations on a pair of multipliers 711a,b: two 16x16 multiplies or four 8x8 multiplies with all combination of signed or unsigned numbers, Q-shifting and P-shifting of multiply results, rounding for multiply instructions, controlling the carry chain by breaking/joining the carry chain at 16-bit block boundaries, and saturation multiplication where the final result is shifted left by 1 or returns 0x7FFFFFFF if an overflow occurs. Galois multiply unit 700 performs Galois multiply in parallel with M multiply unit 710. The lower 32 bits (bits 31:0) of a result are selected by multiplexer 734 and are stored in the even register of a register pair. The upper 32 bits (bits 63:33) of the result are selected by multiplexer 735 and are stored in the odd register of the register pair. A more detailed description of configurable multiply circuitry is provided in co-assigned U.S. Patent application S.N. ~~\_\_\_\_\_ (TI-26010)~~ 09/703,093 entitled

Appl. No. 09/703,140  
Amtd. dated March 10, 2004  
Response to Office Action of November 10, 2003

*Data Processor With Flexible Multiply Unit* and is incorporated herein by reference. Details of the Galois multiply unit are provided in co-assigned U.S. Patent application S.N. \_\_\_\_\_ (T.I. 26013) 09/507,187 to David Hoyle entitled *Galois Field Multiply* and is incorporated herein by reference.